Generalization Bounds for Domain Adaptation 
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Abstract 

In this paper, we provide a new framework to obtain the generalization bounds of the 
learning process for domain adaptation, and then apply the derived bounds to analyze the 
asymptotical convergence of the learning process. Without loss of generality, we consider 
two kinds of representative domain adaptation: one is with multiple sources and the other 
is combining source and target data. 

In particular, we use the integral probability metric to measure the difference between 
two domains. For either kind of domain adaptation, we develop a related Hoeffding-type 
deviation inequality and a symmetrization inequality to achieve the corresponding gener- 
alization bound based on the uniform entropy number. We also generalized the classical 
McDiarmid's inequality to a more general setting where independent random variables can 
take values from different domains. By using this inequality, we then obtain generaliza- 
tion bounds based on the Rademacher complexity. Afterwards, we analyze the asymptotic 
convergence and the rate of convergence of the learning process for such kind of domain 
adaptation. Meanwhile, we discuss the factors that affect the asymptotic behavior of the 
learning process and the numerical experiments support our theoretical findings as well. 



1 Introduction 



The generalization bound measures the probability that a function, chosen from a function class 
by an algor ithm, has a su f ficiently small error an d plays an important role in statistical learning 



theory (see IVapnikl . Il999l ; iBousquet et all 120041 ) . The generalizatio n bounds have been widely 



used to study the consistency of the E RM-based learning process ( IVapnikl . Il999l ). the asymp- 
totic convergence o f empirical process ( IVan der Vaart and Wellnerl . Il996l ) and the learnability 
of learning models (IBlumer et all Il989l ). Generally, there are three essential aspects to obtain 
the generalization bounds of a specific learning process: complexity measures of function classes, 
deviation (or concentra tion) inequalities and symm e trizat ion inequalities related to the learning 
process. For example, IVan der Vaart and Wellnerl (119961 ) presented the generalization bounds 



based on the Rademacher complexity and the covering number, respectively. IVapnik! (119991) gave 



the ge neralization bounds based on the Vapnik-Chervonenkis (VC) dimension. iBartlett et al. 



(120051 ) proposed the local Rademacher complexity and obtai ned a sharp generalization bound 



Hussain and Shawe- Taylor! (120111 ) 



for a particular function class {/ G T : E/ 2 < (3Ef,(3 > 0} 
showed improved loss bounds for multiple kernel learning. 

It is noteworthy that the aforementioned results of statistical learning theory are all built un- 
der the assumption that training and test data are drawn from the same distribution (or briefly 
called the assumption of same distribution). This assumption may not be valid in the situation 
that training and test data have differe nt distributions , whic h will arise in many practical ap- 
plications including speech recognition (IJiang and Zhail . 120071 ) and natural language processing 
( IBlitzer et 0/1120071 ). Domain adaptation has recently been proposed to handle this situation and 
it is aimed to apply a learning model, trained by using the samples drawn from a certain domain 
(source doma in), to the samples drawn from another doma i n (target domain) with a different dis 



201CH : iBian et all 12013 ). 



tribu t ion (seelBickel et all 120071 ; IWu and Dietterichl . 12004 iBlitzer et all l2006t iBen-David et al. 



Without loss of generality, this paper is mainly concerned with two types of representative 
domain adaptation. In the first type, the learner receives traini ng data from several source 



domains, known as domain adaptation with multiple sources (see I Crammer et all 120061 . 12008 



Mansour et all I2008L l2009al ). In the second type, the learner minimizes a convex combination 



of the source a nd the target emp i rical risks, termed as dom ain adaptation combining source and 
target data (see IBen-David et all l2010t IBlitzer et all 120081 ). 



1.1 Overview of Main Results 

In this paper, we present a new framework to obtain the generalization bounds of the learning pro- 
cess for the aforementioned two kinds of representative domain adaptation, respectively. Based 
on the resultant bounds, we then analyze the asymptotical properties of the learning processes 
for the two types of domain adaptation. There are four major aspects in the framework: 

• the quantity measuring the difference between two domains; 

• the complexity measure of function class; 

• the deviation inequalities of the learning process for domain adaptation; 

• the symmetrization inequality of the learning precess for domain adaptation. 

Generally, in order to obtain the generalization bounds of a learning process, one needs to 
develop the related deviation (or concentration) inequalities of the learning process. For either 
kind of domain adaptation, we use a martingale method to develop the related Hoeffding-type 
deviation inequality. Moreover, in the situation of domain adaptation, since the source domain 
differs from the target domain, the desired symmetrization inequality for domain adaptation 
should incorporate some quantity to reflect the difference. From the point of this view, we then 
obtain the related symmetrization inequality incorporating the integral probability metric that 
measures the difference between the distributions of the source and the target domains. Next, we 
present the generalization bounds based on the uniform entropy number for both kinds of domain 
adaptation. Also, we generalize the classical McDiarmid's inequality to a more general setting, 
where independent random variables take values from different domains. By using the derived 



inequality, we obtain the generalization bounds based on the Rademacher complexity. Following 
the resultant bounds, we study the asymptotic convergence and the rate of convergence of the 
learning process in addition to a discussion on factors that affect the asymptotic behaviors. The 
numerical experiments support our theoretical findings as well. Meanwhile, we give a comparison 
with the related results under the assumption of same distribution. 

1.2 Organization of the Paper 

The rest of this paper is organized as follows. Section |2] introduces the problems studied in this 
paper. Section [3] introduces the integral probability metric to measure the difference between 
two domains. In Section HJ we introduce two kinds of complexity measures of function classes 
including the uniform entropy number and the Rademacher complexity. In Section [5] (resp. 
Section [6]), we present the generalization bounds of the learning process for domain adaptation 
with multiple sources (resp. combining source and target data), and then analyze the asymptotic 
behavior of the learning process in addition to the related numerical experiment supporting our 
findings. In Section [7J we list the existing works on the theoretical analysis of domain adaptation 
as a comparison and the last section concludes the paper. In the appendices, we prove main 
results of this paper. For clarity of presentation, we also postpone the discussion of the deviation 
inequalities and the symmetrization inequalities in the appendices. 

2 Problem Setup 

In this section, we formalize the main issues of this paper by introducing some necessary notations 

2.1 Domain Adaptation with Multiple Sources 

We denote Z^ := X (s ^ x y^) cR 1 xR J (1 < k < K) and Z^ := X<?) x y^ C R 1 x R J as 
the k-th. source domain and the target domain, respectively. Set L = I + J. Let T>^ Sk ^ and T)^ 
stand for the distributions of the input spaces X^ Sk > (1 < k < K) and X^ T \ respectively. Denote 
g (s k ) . X (s k ) _^ y(s k ) and g p . X (r) _^ y{T) ag the labeling f unct i ns of Z (s ^ (1 < k < K) and 

Z^ T \ respectively. In the situation of domain adaptation with multiple sources, the input-space 
distributions V^ Sk ^ (1 < k < K) and T>^ differ from each other, or g\ k (1 < k < K) and gi 
differ from each other, or both of the cases occur. There are sufficient amounts of i.i.d. samples 
Zi 1 k = {zi } n =± drawn from each source domain Z^ Sk ^ (1 < k < K) but little or no labeled 
samples drawn from the target domain Z^. 

Given w = (wi, ■ ■ • ,wk) E [0,1]^ with J2k=i Wk = 1> ^ 9w £ 5 be the function that 
minimizes the empirical risk 

K K N k 

E^\eog) = J2w k E%\iog) = ^j:m^%y { n k) ) (i) 

fe=l fc=l k n=l 

over Q with respect to sample sets {Z x k }^ =11 and it is expected that g w will perform well on the 
target expected risk: 

Epilog) := J £(g(^),y^)dP(zW), g G Q, (2) 



i.e., g w approximates the labeling g; as precisely as possible. 

In the learning process of domain adaptation with multiple sources, we are mainly interested 
in the following two types of quantities: 

• E( T )(£ o g w ) — Ew (£ ° (?w), which corresponds to the estimation of the expected risk in the 
target domain Z^ from the empirical quantity that is the weighted combination of the 
empirical risks in the multiple sources {2^ Sh '}^ =l ; 

• Fj ( - T \£og w )—E^(£o'g :t: ) 1 which corresponds to the performance of the algorithm for domain 
adaptation with multiple sources, 

where 7j* £ Q is the function that minimizes the expected risk E^ T \£ o g) over Q. 
Recalling ([[)) and ([2]), since 

E^(£og,)-E^(£og w )>0, 

we have 

E^{£ o 9w ) =E^(£ o g w ) - E^(£ o ft) + E^(£ o ft) 

<E^(£ o ft) - E<£\£ o g w ) + E^(£ o <? w ) - E T (£ o ft) + E T (£ o ft) 

<2sup\E^ T \£o g) -E^(£o g)\+E^(£og t ), (3) 

see 



and thus 



< E (T) (£ o g w ) - E {T \£ o ft) < 2 sup \E^ T \£ o g) - E^{£ o g)\. 

geg 



This shows that the asymptotic behaviors of the aforementioned two quantities when the sample 
numbers N\, ■ ■ • , N^ go to infinity can both be described by the supremum 

suplE^o^-E^os)!, (4) 

geg 

which is the so-called generalization bound of the learning process for domain adaptation with 
multiple sources. 

For convenience, we define the loss function as class 

F:={z^£(g(x),y):geg}, (5) 

and call J 7 as the function class in the rest of this paper. By ([1]) and (J2J), given sample sets 
{Z 1 k }k=i drawn from {Z^ Sk '}^ =1 respectively, we briefly denote for any f E J 7 , 

E^f:=jf(z^)dP^), (6) 

and 

K N k 

E ^ : =£^£/( z i fe) )- ( 7 ) 

k=l k n=l 

Thus, we rewrite the generalization bound (J3J) for domain adaptation with multiple sources as 

Bup|ECO/-Eg)/|. (8) 



2.2 Domain Adaptation Combining Source and Target Data 



Denote Z& := X^xyW C 



xJ 



&ndZ^:=X^xy^ C 



xJ 



as the source domain and 



the target domain, respectively. Let T>^ and V^ stand for the distributions of the input spaces 
X& and X^ T \ respectively. Denote g[ S) : X^ -> y^ and gP : X^ -> ^ (T) as the labeling 
functions of Z^ and Z^ T \ respect i vely. I n the situation of doma in adaptation combining source 



and target data (see iBlitzer et all 120081 ; iBen-David et all |201C]), the input-space distributions 



AT) 



differ from each 
' l T 1 drawn 



T>^ and T)^ differ from each other, or the labeling functions g* and g 

other, or both cases occur. There are some (but not enough) samples Z 1 T 

from the target domain Z^ in addition to a large amount of samples 7i 1 s 

from the source domain Z^ with N^ <C N^ s \ Given a r G [0, 1), we denote g T G Q as the 

function that minimizes the convex combination of the source and the target empirical risks over 



{Z V ; 

{zn}n=t drawn 



pea 



(S), 



E T (£og):=rE^(£og) + (l-r)E^>(£og), (9) 

and it is expected that g T will perform well for any pair z^ = (x^ T \y^) G Z^ T \ i.e., g T 

(T) 

approximates the lab eling function q; a s precisely as poss i ble. 

As mentioned by lBlitzer et all (120081 ); IBen-David et all (J2010l ). setting r involves a tradeoff 
between the source data that are sufficient but not accurate and the target data that are accurate 
but not sufficient. Especially, setting r = provides a lear ning process of the basic domain 
adaptation with one single source (see IBen-David et all . \200w . 

Similar to the situation of domain adaptation with multiple sources, two types of quantities: 
E( T ) [£ o g T ) — E r (i o g T ) and E^ T ^ [£ o g T ) — E^ [£ o g^) also play an essential role in analyzing the 
asymptotic behavior of the learning process for domain adaptation combining source and target 
data. By the similar way of (J3]), we need to consider the supremum 



sup\E {T) (£og) 
geg 



E T (£og)\ 



(10) 



which is the so-called generalization bound of the learning process for domain adaptation com- 
bining source and target data. Following the notation of §5§ and taking / = £ o g, we can 
equivalently rewrite the generalization bound (TTUI) as 



sup|E( T )/-E r /|. 
feJ 7 



(11) 



3 Integral Probability Metric 



As shown in some existing works (see Mansour et all . 120081 . l2009al : IBen-David et all l2010l . 120061 ) , 
one of major challenges in the theoretical analysis of domain adaptation is to find a quantity to 
measure the difference between the source domain Z^ and the target domain Z^ T >. Then, one 
can use the quantity to achieve generalization bounds for domain adaptation. In this section, we 
use the integral probability metric to measure the difference between the distributions of Z^ 
and Z^ T \ and then discuss the relationship between the integral probability metric and other 
quantities proposed in existing works, e.g., th e H-divergence and the discrepancy distance 



sec 



Ben-David et all l2010t Mansour et all l2009bj). Moreover, we will show that there is a special 



situation of domain adaptation, where the integral probability metric performs better than other 
quantities (see Remark [3.11) 



3.1 Integral Probability Metric 



In iBen-David et al\ (120101 . 120061 ). the H,- divergence was introduced to derive the generaliza tion 
bounds based on the VC dimension under the condition of "A-close" . iMansour et al\ (|2009b|) ob- 
tained the generalization bounds based on the Rademacher complexity by using the discrepancy 
distance. Both quantities are ai med to measure t he diff erence between two input-space distri- 
butions T>^ and T)( T \ Moreover, IMansour et all ( 2009al ) used the Renyi divergence to measure 
the distance between two distributions. In this paper, we use the following quantity to measure 
the difference between the distributions of the source and the target domains: 

Definition 3.1 Given two domains Z^ S \Z^ C M L ; let z^ and z^ be the random variables 
taking values from Z^ s > and Z^ T \ respectively. Let T C M. z be a function class. We define 



D T (S,T):~- 



sup|E^/-E( T >/|, 



(12) 



where the expectations E^ and E^ are taken on the distributions Z^ and Z^ , respectively. 

The quantity Djr(S,T) is termed as the integral probability metric that has played an im- 
portant role in probability theo r y for measuring the difference between the two probabilit y dis- 



tributi ons fsee IZolotarevl. ll984J: iRachev! [l99ll : iMiillerl . fl997l : iReid and Williamson! . l201lh . Re 



cently, ISriperumbudur et all (120091 . 120121 ) gave the further investigation and proposed an e mpir- 
ical method to compute the integral probability metric. As mentioned by IMiillerl (119971 ) [page 
432], the quantity Djr(S,T) is a semimetric and it is a metric if and only if the function class 
T separates the set of all signed measures with fi(Z) = 0. Namely, according to Definition 13. 1[ 
given a non-trivial function class J 7 , the integral probability metric Djr(S, T) is equal to zero if 
the domains Z^ and Z^ have the same distribution. 

By (J5]), the quantity Djr(S,T) can be equivalently rewritten as 

DAS,T) =sup E^(x^),y^) -E^£(g(^),y^] 
geG 



sup 

geg 



E^i(g(^%gi s \^)) - E^i(g(^),gP(^)) 



(13) 



Next, based on the equivalent form f[T3"j) . we discuss the relationships between the quantity 
Dj?(S,T) and other quantities including the "H- divergence and the discrepancy distance. 



3.2 Relationship with Other Quantities 

Before the formal discussion, we briefly introduce the r elated quantities proposed in existing 
works (see IBen-David et all |2010| ; IMansour et all l2009bl ) . 



3.2.1 "H-Divergence and Discrepancy Distance 

In classification t asks, by setting £ as the absolute- value loss function (£(x, y) 
Ben-David et all (120 lOl ) introduced a variant of the %- divergence: 



d HAn (rt s \V^) = sup E^i( gi (^),g 2 (^))-E^i( gi (^),g 2 (^)) 
gi,g2&H 



(14) 



to achieve VC-dimension-based generalization bounds for domain adaptation under the condition 
of "A-close" : there exists a A > such that 



A 



> inf | I £(g(^ s) ),y {s) )dP(z {s) ) + f £((?(x (T) ),y (T) )rfP( 5 



(T) 



In both of the classifi cation and regression tasks, given a function class Q and a loss function 
£, Mansour et all (j2009bl ) defined the discrepancy distance as 



disc £ (£> (5) ,£> (T) ) = sup 



E^£( 



gi[ ^),g 2 ^))-E^£( gi ^),g 2 ^)) 



(15) 



and then used this quantity to obtain the generalization bounds based on the Rademacher com- 
plexity. 



As mentioned by Mansour et all (I2009bl ). the quantities (JHj) and (TT5l) match in the setting 
of classification tasks by setting £ as the absolute-value loss function, while the usage of ( ITBT) 
does not require the condition of "A-close" but the usage of (CO 



does. Recalling Definition 13. 1[ 
since there is no limitation on the function class J 7 , the integral probability metric Djr(S, T) can 
be used in both classification and regression tasks. Therefore, we only consider the relationship 
between the integral probability metric Djr(S,T) and the discrepancy distance disc£(T>^ s \T>^). 



3.2.2 Relationship between D T (S,T) and disQ(£> (5) ,£> (T) ) 



From Definition 13 . II and ( TT3l) . we can find that the integral probability metric Djr(S, T) measures 
the difference between the distributions of the two domains Z^ and Z^ T \ However, as addressed 
in Section [2j if a domain Z^ differs from another domain Z^ , there are three possibilities: 
the input-space distribution T>^ differs from V^ T \ or gi differs from g; , or both of them 
occur. Therefore, it is necessary to consider two kinds of differences: the difference between the 
input-space distributions T>^ and T)^ and the difference between the labeling functions gi 
and gi . Next, we will show that the integral probability metric Djr(S,T) can be bounded by 
using two separate quantities that can measure the difference between T>^ and V^ and the 
difference between gi and gi , respectively. 

As shown in ( fl5l) . the quantity disci(T>( s ' ,T>^) actually measures the difference between the 
input-space distributions T>^ s ' and T>( T '. Moreover, we introduce another quantity to measure 

( S) (T) 

the difference between the labeling functions gi and gl . 

Definition 3.2 Given a loss function £ and a function class Q , we define 



qT 



(sf»,sf»; 



sup 

9160 



E (T) %( 



ATY 



• 9™* 



(T) ))-E( r )!( gi (xm) l9 f(/))) 



(16) 



Note that if the loss function 



and the function class Q are both non-trivial (i.e., T is non- 
trivial), the quantity Qg '(gi^' , gi ) is a (semi)metric between the labeling functions gi and 



AT) 



(T),(S) (T), 



gi . In fact, it is not hard to verify that Qg (gi ,gi ) satisfies the triangle inequality and is 



(S) 



equal to zero if gi and gi match. 



,C0 



disQ(£> (5) ,£> (T) ) = sup 



> sup 

91 eg 

= sup 
gieg 



> sup 

gieg 



By combining ( TT3|) . (TT5l) and f|T6l) . we have 

+ E^i( gi (^),gP(^)) - E^H(g x {^),gf\^)) 

E^i( 9l ^% 9 i s \^)) - E^i( 9l (^),gP(^)) 

E^£( 9l (^),gP(^)) - E^i( 9l (^),gi s \^)) 



and thus 



— sup 

gi eg 

-DAS,T)-Q^(gi s \gP) 



DAS,T) < dise^ (5) ,X? (T) ) + Q ( P(gi s \gP) 



(17) 



'U 



which implies that the integral probability metric Djr(S, T) can be bounded by the summation 
of the discrepancy distance disQ(£ , ( 5 \r> < - T ' ) ) and the quantity Q g (gi ,gi ), which measure the 

difference between the input-space distributions T>^ and T)^ and the difference between the 

-re- 



labeling functions gi and gi , respectively. 

Remark 3.1 Note that there is a specific case in the situation of domain adaptation: T>^ differs 
fromV^ and meanwhile gi differs from g; , while the distribution of the domain Z^ matches 
with that of the domain Z^ T '. In this case, the integral probability metric Djr(S,T) equals to zero, 
but disc^T^ 5 ),!)^) or Qg \gl \gl ') neither equals to zero. Therefore, the integral probability 
metric Djr(S,T) is more suitable for this case than the discrepancy distance disc£(2)( s '' ) ,D (T )). 



4 Complexity Measures of Function Classes 

Generally, the generalization bound of a certain learning process is achieved by incorporating 
some complexity measure of the function class, e.g., the covering number, the VC dimension and 
the Rademacher complexity. In this paper, we are mainly concerned with the uniform entropy 
number and the Rademacher complexity. 



4.1 Uniform Entropy Number 

The uniform entr opy number is derived from the concept of the covering number and we refer to 
Mendelsonl ( 120031 ) for details. The covering number of a function class T is defined as follows: 



Definition 4.1 Let T be a function class and d be a metric on J- '. For any £ > 0, the covering 
number of T at radius £ with respect to the metric d, denoted by M{J-, £, d) is the minimum size 
of a cover of radius £ . 



In some classical results of statistical learning theory, the covering number is applie d by letting 



d be t he distribution-dependent metric. For example, as shown in Theorem 2.3 of iMendelson 
((20031), one can set d as the norm ^i(Z^) and then derive the generalization bound of the i.i.d. 



learning process by incorporating the expectation of the covering number, i. e., EA/"(J r , £, £i(Z^)). 
However, in the situation of domain adaptation, we only know the information of the source 
domain, while the expectation EA/"(J r , £, ^(Z^)) is dependent on the distributions of the source 
and the target domains because z = (x, y). Therefore, the covering number is no longer suitable 
for our framework to obtain the generalization bounds for domain adaptation. In contrast, the 
uniform entropy number is distribution-free and thus we choose it as the complexity measure of 
function classes to derive the generalization bounds for domain adaptation. 

Next, we will consider the uniform entropy number of T in the situations of two types 
of domain adaptation: (i) domain adaptation with multiple sources; (ii) domain adaptation 
combining source and target data, respectively. 



4.1.1 Domain Adaptation with Multiple Sources 



For clarity of presentation, we give a useful notation for the following discussion. Let {Z x k }^ =1 := 
{{zn } n =i\k=i b e the collection of sample sets drawn from multiple sources {Z^ Sk ^}^ =1 , respec- 
tively. Denote {Z\ k }^ =1 : = {{z „ } n =i}fLi as the collection of the ghost sample sets drawn from 
{Z^ Sk '}f =1 such that the ghost sample t! n has the same distribution as z„ for any 1 < k < K 
and any 1 < n < Nk. Denote Z x k := {Z 1 fc ,Z' 1 *} for any 1 < k < K. Moreover, given an 
/ G T and a w = (w\, ■ ■ ■ ,Wk) G [0, 1] A with J2k=i w k = 1> we introduce a variant of the l\ 
norm: 



K N k 

- -EfEi^* (19) 

fc=l k n=l 



?r({zrnf =1 ) • z^ Nk 



It is noteworthy that the variant 0.™ of the l\ norm is still a norm on the functional space, which 
can be directly verified by using the definition of norm, so we omit it here. 

In the situation of domain adaptation with multiple sources, by setting the metric d as 
€^({7i 1 fe } A =1 ), we then define the uniform entropy number of T with respect to the metric 
C({Zf fc }fii) as 



A' 



lnA/7(.F,£,2^iV fc ):= sup InAf (V,^7({Zf fc }f=i)) . (20) 

4.1.2 Domain Adaptation Combining Source and Target Data 

In the situation of domain adaptation combining source and target data, we have to introduce 
another variant of the t\ norm on T . Let Z x s = {z„ } n i x and Z 1 T = {z„ } n =i be two sets of 
samples drawn from the domains Z^- s ' and Z^ T \ respectively. Given an / G J, we define for any 
re [0,1), 

N T _ N s 

WW**) : = ^Ei/(4 T) )i + -jr : El/(«S? ) )l- < 21 > 

1 n=X ° n=l 

Note that the variant l\ (r G [0, 1)) of the norm l\ is still a norm on the functional space, which 
can be easily verified by using the definition of norm, so we omit it here. 

9 



Moreover, let Z' x s and Z 1 be the ghost sample sets of Z x s and Z x T , respectively. Denote 

Z x s := {Z x s , 71 x s } and Z x T := {Z 1 T , Z x }, respectively. Then, the uniform entropy number 
of T with respect to the metric ^{(Z) is defined as 

lnN{(T,Z,2(N s + N T )) :=suplnAT(^,e,£l(Z)), (22) 

z 

where Z := {Zf^zf^}. 

4.2 Rademacher Complexity 

The Rademacher com plexity is one of the most fre q uently used compl e xity m easures of function 



classes and we refer to IVan der Vaart and Wellnerl (119961 ); iMendelsonl (120031 ) for details 



Definition 4.2 Let J 7 be a function class and {z n }^ =1 be a sample set drawn from Z. Denote 
{a n }n=i be a set of random variables independently taking either value from { — 1,1} with equal 
probability. Rademacher complexity of T is defined as 



|^IX>^MJ 



K(T) := Esup { -| > <7„/(z n ) ) (23) 



with its empirical version 



[^lE^MJ 



^(^-E^sup^-l) «r n /(z n )|^, (24) 

feT 

where E stands for the expectation taken with respect to all random variables {z n }^ =1 and 
{a n }n = i, and E CT stands for the expectation only taken with respect to the random variables 

5 Learning Processes of Domain Adaptation with Multi- 
ple Sources 

In this section, we present two generalization bounds of the learning process for domain adapta- 
tion with multiple sources. They are based on the uniform entropy number and the Rademacher 
complexity, respectively. By using the derived bounds based on the uniform entropy number, 
we then analyze the asymptotic convergence and the rate of convergence of the learning process. 
The numerical experiment supports our theoretical analysis as well. 

5.1 Generalization Bounds 

Based on the uniform entropy number defined in (1201) . a generalization bound for domain adap- 
tation with multiple sources is presented in the following theorem. 



10 



Theorem 5.1 Assume that J 7 is a function class consisting of bounded functions with the range 
[a,b]. Let w = (w%, ■ ■ • ,Wk) G [0, 1] K with ^2 k=1 w k = 1. Then, given an arbitrary £ > 
Dj?'(S, T), we have for any ( Ylk=i -^*) — li$ an ^ an V e > 0, with 'probability at least 1 — e, 



snv\E^f-^f\<D^\S,T) + 
feT 



W^C'/MEf^) -ln(e/8 ' 

32(6-a)2(Ef =1 ^(n^ fe ^)) 



(25) 



(w). 



where £' = £ — D-p'(S, T) and 



K 



D { ^\S,T):=J2^DAS k ,T). 



(26) 



fe=i 



In the above theorem, we show that the generalization bound supj- G j- |E( T )/ — Ew / 1 can be 
bounded by the right-ha nd side of (|25|). C ompared to the classical result under the assumption 



of same distribution (see iMendelson 
least 1 — e, 



sup E N f-Ef <0 

f£J" 



20031 . Theorem 2.3 and Definition 2.5): with probability at 
ln.A/i(.F,f,iV) -ln(e/8)' 



N 



(27) 



with E^/ being the empirical risk with respect to the sample set Z^ , there is a discrepancy 
quantity D™ (S, T) that is determined by the two factors: the choice of w and the integral 
probability metrics Djr[S k ,T) (1 < k < K). The two results will coincide if any source domain 
and the target domain match, i.e., Djr(S k ,T) = holds for any 1 < k < K. 

In order to prove this result, we develop the specific Hoeffding-type deviation inequality and 
the symmetrization inequality for domain adaptation with multiple sources, respectively. The 
detailed proof is arranged in Appendix [S] Subsequently, we give another generalization bound 
based on the Rademacher complexity: 

Theorem 5.2 Assume that J 1 is a function class consisting of bounded functions with the range 
[a, b]. Let w = (wx, • • • , wk) € [0, 1} K with E&=i w k = 1. Then, we have with probability at least 
1-e, 



K 



sup |E«?/ - E^f\ < D^(S,T) +2Y j w k KW{T) + 



fe=i 



K 



\ fc=i 



(&-q)^ln(l/6) 

2N k 



(28) 



where D™ (S,T) is defined in ( )26l) and Tc-^^) (1 < k < K) are the Rademacher complexities 
on the source domains Z^ Sk \ respectively. 

Similarly, the derived bound (128]) coincides wi t h the related classical result under the assump- 
tion of same distribution (see iBousquet et all I2004J . Theorem 5), when any source domain of 
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| Z (5 fc )|X =i and the target domain Z {T) match; ie ^ D ( ™ ] (S,T) = D T (S k ,T) = holds for any 
1 < k < K. The proof of this theorem is processed by introducing a generalized version of 
McDiarmid's inequality which allows independent random variables to take values from different 
domains (see Appendix ICl) . 

Subsequently, based on the derived bound ff25l) . we can analyze the asymptotic behavior of 
the learning process for domain adaptation with multiple sources. 

5.2 Asymptotic Convergence 

In statistical learning theory, it is well-known that the complexity of function class is one of main 
factors to th e asymptotic convergence of the learning process under the assu mption of same 
distribution (jVapnikl . Il999l ; IVan der Vaart and Wellnerl . Il996l ; iMendelsonl . 120031 ) . 



From Theorem 15.11 we can directly arrive at the following result showing that the asymptotic 
convergence of the learning process for domain adaptation with multiple sources is affected by 
the three aspects: the choice of w, the discrepancy quantity D™ (S,T) and the uniform entropy 
number In ^f™ (T, £/8, 2 £f =1 N k ) . 

Theorem 5.3 Assume that J 7 is a function class consisting of the bounded functions with the 
range [a, b}. Let w = (wi, • • • , %) G [0, 1} K with Ylk=i w k = 1- If ^ e following condition holds: 

lim mW.VWTZ.M < 

JV 1 ,...,JV JC -H-oo (njLyM*) 



32(b-a) 2 (Ef=i^(n^ fe ^)) 



then we have for any £ > D™ (S, T), 



r 
lim Pr(sup|E^/-E^/| > f } = 0. (30) 

As shown in Theorem I5.3[ if the choice of w e [0,1]^ and the uniform entropy number 
In A/^ (.F, £'/8, 2 J2k=i Nk) satisfy the condition (129]) with J2k=i w k = 1> the probability of the 
event that sup^ 6jr |E^/ — Ew /| > £ will converge to zero for any £ > Dj?'(S,T), when the 
sample numbers N±, • • • , Nk of multiple sources go to infinity, respectively. This is partially in 
accordance with the classical result of the asymptotic convergence of the learning process under 
the assumption of same distribution (cf. Theorem 2.3 and Definition 2.5 of [22]): the probability 
of the event that supj G jr E/ — Ejv/ > £ will converge to zero for any £ > 0, if the uniform 
entropy number In A/i (J 7 , £, N) satisfies the following: 

hm < +00. 31 

iV->+oo N 

Note that in the learning process of domain adaptation with multiple sources, the uniform 
convergence of the empirical risk on the source domains to the expected risk on the target domain 
may not hold, because the limit ( 130]) does not hold for any £ > but for any £ > D™ (S,T). 
By contrast, the limit ( 130 j) holds for all £ > in the learning process under the assumption of 
same distribution, if the condition (I3"T|) is satisfied. Again, these two results coincide when any 
source domain and the target domain match, i.e., D™ (S,T) = Djr{Sk,T) = holds for any 
1 < k < K. 

Next, we study the rate of convergence of the learning process for domain adaptation with 
multiple sources. 
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5.3 Rate of Convergence 

Recalling (|25l) . we can find that the rate of convergence is affected by the choice of w. According 
to the Cauchy-Schwarz inequality, setting Wk = N^J J2k=i N^ (1 < k < K) , we have 



(Uti N k) 1 N 1 + N 2 + --- + N J 



K 



" K " { 32(b-a)Hj£ lW l(n„ k N,))r 32(*-« ^ 



a) 



which minimizes the second term of the right-hand side of (1251) . Thus, by f[2~5l) . (I27)) and 
we find that the fastest rate of convergence of the learning process is up to 0(1/ viV) which 
is the same as the classical result (127)1 of the learning process under the assumption of same 
distribution if the discrepancy D^ (S, T) is ignored. 

In addition, the bound (1281) based on the Rademacher complexity also implies that the rate of 
convergence of the learning process is affected by the choice of w. Again, according to Cauchy- 
Schwarz inequality, setting Wk = sr K k (1 < k < K) leads to the fastest rate of convergence: 



'(»-'PMiA) =0(1/ ^ )t 

which is in accordance with the aforementioned analysis. The following numerical experiments 
support our theoretical findings (see Fig. [[]). 

5.4 Numerical Experiment 

We have performed the numerical experiments to verify the theoretic analysis of the asymptotic 
convergence of the learning processes for domain adaptation with multiple sources. Without loss 
of generality, we only consider the case of K = 2, i.e., there are two source domains and one 
target domain. The experiment data are generated in the following way: 

For the target domain Z^ = X^ x Ja t ) c IR 100 x R, we consider X^ T ' as a Gaussian 
distribution JV(0, 1) and draw {x„ } n Z\ {Nt = 4000) from X^ T ' randomly and independently. 
Let (3 G IR 100 be a random vector of a Gaussian distribution N(l, 5), and let the random vector 
R G R 100 be a noise term with R ~ N(0, 0.5). For any 1 < n < Nt, we randomly draw (3 and R 
from iV(l, 5) and iV(0, 0.01) respectively, and then generate yk G y as follows: 

yjp = (xf), /?) + R. (33) 

The derived {(x„, , y„ )} n =i (Nt = 4000) are the samples of the target domain Z^ and will 
be used as the test data. 

In the similar way, we generate the sample set {(x„ , j/n )} n =i (-^l = 2000) of the source 
domain Z^ = X^ x yW c M 100 x R: for any 1 < n < N ± , 

y^ = (^\(3} + R, (34) 

where xi 1} ~ JV(0.5, 1), p ~ N{1, 5) and i? ~ N(0, 0.5). 

For the source domain Z^ = X® x y^ c Rioo x Rj the samp i es {(x^,^)}^ (JV 2 = 
2000) are derived in the following way: for any 1 < n < N 2 , 

y^ = i^\(5) + R, (35) 
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where x^ 2) ~ JV(2, 5), (3 ~ JV(1, 5) and R ~ JV(0, 0.5). 

In this experiment, we use the method of Least Square Regressions to minimize the empirical 
risk 

Nl (i - \ N2 

E^( x i 1 ^^ ) ) + i V^E^(^ 2) )'^ ) ) ( 36 ) 



vi s \eo g ) 



w 



Ni 



n=l 



n=l 



for different combination coefficients w G {0.1,0.3,0.5,0.9} and then compute the discrepancy 
\Ew f — E N ^f\ for each N± + N 2 . The initial Nx and N 2 both equal to 200. Each test is repeated 
30 times and the final result is the average of the 30 results. After each test, we increment Nx 
and N 2 by 200 until Nx = N 2 = 2000. The experiment results are shown in Fig. HJ 




0.25 



500 1000 1500 2000 2500 3000 3500 4000 



Figure 1: Domain Adaptation with Multiple Sources 



From Fig. [TJ we can find that for any choice of w, the curve of \Ew f — E N f\ is decreasing 
when Nx + N 2 increases, which is in accordance with the results presented in Theorems 15. II fc [5.31 
Moreover, when w = 0.5, the discrepancy \Ew f — E N f\ has the fastest rate of convergence, 
and the rate becomes slower as w is further away from 0.5. In this experiment, we set iVi = N 2 
that implies that N 2 /(Nx + iV 2 ) = 0.5. Recalling fl25}, we have shown that w = N 2 /(Nx + N 2 ) 
will provide the fastest rate of convergence and this proposition is supported by the experiment 
results shown in Fig. [TJ 



6 Learning Process of Domain Adaptation Combining 
Source and Target Data 

In this section, we present two generalization bounds of the learning process for domain adap- 
tation combining source and target data, which are based on the uniform entropy number and 



1 SLEP Package: http://www.public.asu.edu/~jye02/Software/SLEP/index.htm 
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the Rademacher complexity, respectively. We then analyze the asymptotic convergence and the 
rate of convergence of the learning process in addition to the numerical experiments supporting 
our theoretical analysis. 

6.1 Generalization Bounds 

The following theorem provides a generalization bound based on the uniform entropy number 
with respect to the metric l\ defined in ( !22l) . Similar to the situation of domain adaptation 
with multiple sources, the proof of this theorem is achieved by using a specific Hoeffding-type 
deviation inequality and a symmetrization inequality for domain adaptation combining source 
and target data (see Appendix IB"]) . 

Theorem 6.1 Assume that J 7 is a function class consisting of the bounded functions with the 
range [a,b]. Let Z x s = {z n } n =i and Z 1 T = {z„ } n Z\ be two sets of i.i.d. samples drawn 
from domains Z^ and Z^ T \ respectively. Then, for any r G [0, 1) and given an arbitrary 
£ > (1 — t)Djt(S,T), we have for any NsNt > rZ$ , with probability at least 1 — e, 

sup IE,/ - EOVI < (1 - r )DA S,T) + f '■^■W2M) ) -ln(«/8, , ' _ (37) 



feF 



N S N T 
32(b-a) 2 ((l-T) 2 N T +r 2 N s ) 



where Djr(S,T) is defined in (TT2|) and £':=£ — (1 — t)Djt(S,T). 

Compared to the classical result (1271) under the assumption of same distribution, the derived 
bound ( 1371) contains a term of discrepancy quantity (1 — r)Djr(S,T) that is determined by two 
factors: the combination coefficient r and the quantity Dj?(S,T). The two results coincide when 
the source domain Z^ s > and the target domain Z^ T > match, i.e., Dj^(S,T) = 0. 

Based on the Rademacher complexity, we then get another generalization bound of the learn- 
ing process for domain adaptation combining source and target data. Its proof is postponed in 
Appendix ICl 

Theorem 6.2 Assume that J 7 is a function class consisting of the bounded functions with the 
range [a,b]. Let Z 1 s = {z„ } n =i and Z 1 T = {z„ } n =\ be two sets of i.i.d. samples drawn from 
the domains Z^ s > and Z^> , respectively. Then, given r e [0, 1) and for any e > 0, we have with 
probability at least 1 — e, 

sup \E T f - E^f\ <(1 - t)D t (S, T) + 2(1 - r)llW(F) 
feT 



+ 2r^ (r)+3 rJ^-"^^ 



^ t V-;t-M m 



+(1 _ tW (^mha) ^ + fi-£,, (3f 



where Djr(S, T) is defined in (JT5 
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Note that in the derived bound (13"8"|) . we adopt an empirical Rademacher complexity TZ Nr j, (J 7 ) 
that is based on the data drawn from the target domain Z^ T \ because the distribution of Z^ 
is unknown in the situation of domain adaptation. Similar to the aforementioned discussion, the 
gen eralization bound (1381) coincides with the result under the assumption of same distribution 
fsee iBousquet et all 12004 Theorem 5), when the source domain of Z^ and the target domain 



Z^ match, i.e., D T (S,T) = 0. 

The two results ( 13"T|) and ( 13"Sj) exhibit a tradeoff between the sample numbers N$ and Nt, 
which is associate d with the choice of r. Although the trad eoff has been mentioned in some 



previous works (see lBlitzer et a/.l . l2008l ; lBen-David et all 120101 ). the following will show a rigorous 



theoretical analysis of the tradeoff. 

6.2 Asymptotic Convergence 



Following Theorem 16. 1[ we can directly obtain the concerning result pointing out that the asymp- 
totic convergence of the learning process for domain adaptation combining source and target data 
is affected by three factors: the uniform entropy number In A/j^J 7 , C./8, 2(N$ + Nt)), the integral 
probability metric Djr(S,T) and the choice of r G [0, 1). 

Theorem 6.3 Assume that J 7 is a function class consisting of bounded functions with the range 
[a, b}. Given a r e [0, 1), if the following condition holds: 



\nN{(T,e/8,2(N s + N T )) 



llm N S N T < +0 ° ( 39 ) 



((1-t) 2 7V t +t 2 JV s ) 

with £' := £ — (1 — t)Djt(S, T), then we have for any £ > (1 — t)Djt(S, T), 



lim Pr jsup|E (T) /-E r /| > f 1 = 0. 



Afe-H- 



(40) 



As shown in Theorem I6.3[ if the choice of r e [0, 1) and the uniform entropy number 
In N{{F, f'/8, 2(N S + N T )) satisfy the condition ([39]), the probability of the event sup /Gj - |E( T )/- 
Et/I > i wm converge to zero for any £ > (1 — r)Djr(S,T), when Ns goes to infinity. This 
is partially in accordance with the classical result under the assu mption of same di stributions 



derived from the combination of Theorem 2.3 and Definition 2.5 of iMendelsonl (120031 ). 

Note that in the learning process for domain adaptation combining source and target data, the 
uniform convergence of the empirical risk E r / to the expected risk E^ T ^/ may not hold, because 
the limit (}4"0"|) does not hold for any £ > but for any £ > (1 — t)Djt{S,T). By contrast, the 
limit (|40p holds for all £ > in the learning process under the assumption of same distribution, 
if the condition (13T1) is satisfied. The two results coincide when the source domain Z^ and the 
target domain Z^ match, i.e., Djr(S,T) = 0. 

6.3 Rate of Convergence 

We consider the choice of r that is an essential factor to the rate of convergence for the learning 
process and is associated with the tradeoff between the sample numbers As and Nt- Recalling 
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(157)) . if we fix the value of In J\f{ (J 7 , £'/8, 2(iVg + N T )), setting r = N N + N minimizes the second 
term of the right-hand side of (!3Tj) and then we arrive at 

.,.,,. NsDjr&T) ( (\n^[(T^'/8,2(N s + N T ))-\n(e/8)) \ 
sup |E T / - ^ 7 1 < Ar , A7 . + | Ns+Nt ) , (41) 



fer ' ' ~ AT 5 + N T 



32(6-a) 



2 



which implies that setting r = N N J N can result in the fastest rate of convergence, while it can 
also cause the relatively larger discrepancy between the empirical risk E T / and the expected risk 
E( T )/, because the situation of domain adaptation is set up in the condition that Nt "C Ns, 
which implies that N N J N ~ 1. Moreover, this choice of r associated with a trade off between 
sample numbers N$ and N T is also suitable to the Rademacher-complexity-based bound (J38]). 



It is notew orthy that the value r - - n n Jat has been mentioned in the section of "Experimental 



N T +N S 



Results" in lBlitzer et al\ (120081 ). Here, we show a rigorous theoretical analysis of this value and 



the following numerical experiment also supports this finding (see Fig. |2]). 

6.4 Numerical Experiments 

In the situation of domain adaptation combining source and target data, the samples {(x„ , yh )} n Z\ 
(Nt = 4000) of the target domain Z^ T > are generated in the aforementioned way (see (133)) ). 
We randomly pick N T = 100 samples from them to form the objective function and the rest 
N? = 3900 are used to test. 

In the similar way, the samples {(x„ ,yn )} n =i (Ns = 4000) of the source domain Z^ s > are 
generated as follows: for any 1 < n < Ns, 

yP = (xW,P) + R, (42) 

where x^ 5) ~ N(l, 2), (3 ~ N(l, 5) and R ~ N(0, 0.5). 

We also use the method of Least Square Regression to minimize the empirical risk 

N S AT^ 

E.(^o,) = i^^^( x «),^)) + ^^^( x f)), 2/ f)) 

5 n=l T n=l 

for different combination coefficients r G {0.1,0.3,0.5,0.9} and then compute the discrepancy 

/rp\ 

\E T f — E N „ f\ for each iVs. Since it has to be satisfied that Ns 3> N T , the initial Ns is set to be 
200. Each test is repeated 100 times and the final result is the average of the 100 results. After 
each test, we increment N$ by 200 until N$ = 4000. The experiment results are shown in Fig. 

12 

Figure (j2j) illustrates that for any choice of rG {0.1, 0.3, 0.5, 0.9}, the curve of \E T f — E N „ \ is 
decreasing as Ns increases. This is in accordance with our results of the asymptotic convergence 
of the learning process for domain adaptation with multiple sources (see Theorems 16.11 and I6.3J) . 
Furthermore, Fig. [2] also shows that when r ~ N T /(Ns + N T ), the discrepancy \E)- f — E N „ f\ 
has the fastest rate of convergence, and the rate becomes slower as r is further away from 
N T /(Ns + N T ). Thus, this is in accordance with the theoretical analysis of the asymptotic 
convergence presented above. 



17 



1.8 

1.6 

1.4 

1.2 

5-. 

fcST 1 

■+-, 0.8 
h 

0.6 

0.4 

0.2 






■1=0.1 

■x=0.3 

x=0.5 

-t=0.9 



500 



1000 



1500 



2000 

N s 



2500 



3000 



3500 



4000 



Figure 2: Domain Adaptation Combining Source and Target Data 



7 Prior Works 

There have bee n some previous works on the theoretical analysis of domain adaptation with multi- 
ple sources (see iBen-David et all l2010l : ICrammer etaR l2006l . l2008t iMansour et'ali J2008L l2009ah 



and d omain adaptation combining source and target data (see lBlitzer et a/.l . l2008l ; IBen-David et al. 
2010h 



In ICrammer et al\ ( J2006I . I2008I ) , the function class and the loss function are assumed to satisfy 
the conditions of "a-triangle inequality" and "uniform convergence bound". Moreover, one has 
to get some prior information about the disparity between any source domain and the target 
domain. Under these conditions, some generalization bounds were obtained by using the classical 
tec hniques developed un der the assumption of same distribution. 



Mansour et all ( 120081 ) proposed another framework to study the problem of domain adapta- 



tion with multiple sources. In this framework, one has to know some prior knowledge including 
the exact distributions of the source domains and the hypothesis function with a small loss on 
each source domain. Furthermore, the target domain and the hypothesis function on the target 
domain were deemed as the mixture of the source domains and the mixture of the hypothe- 
sis functions on the so urce domains, respectively. Th en, by introduc i ng th e Renyi divergence, 
Mansour et al\ ()2009aJ) extended their previous work (IMansour et all 120081 ) to a more general 
setting, where the distribution of the target domain can be arbitrar y and one only needs to know 
an approximation of the exact distribution of each source domain. IBen-David et all (120101 ) also 
discus sed the situation o f dom a in adaptation with t he mixture of source domains. 

In IBen-David et al\ (120101 ); iBlitzer et all (120081 ). domain adaptation combining source and 
target data was originally proposed and meanwhile a theoretical framework was presented to 
analyze its properties for the classification tasks by introducing the %- divergence. Under the 
condition of "A-close" , the authors applied the classical techniques developed under the assump- 
tion of same distribution to achieve the generalization bounds based on the VC dimension. 



Mansour et al\ (J2009bl ) introduced the discrepancy distance disci(V^ s \V^) to capture the 



difference between domains and this quantity can be used in both classification and regression 
tasks. By applying the classical results of statistical learning theory, the authors obtained the 
generalization bounds based on the Rademacher complexity. 

8 Conclusion 

In this paper, we propose a new framework to obtain generalization bounds of the learning 
process for two representative types of domain adaptation: domain adaptation with multiple 
sources and domain adaptation combining source and target data. This framework is suitable 
for a variant of learning tasks including classification and regression. Based on the derived 
bounds, we theoretically analyze the asymptotic convergence and the rate of convergence of the 
learning process for domain adaptation. There are four important aspects of this framework: 
the quantity measuring the difference between two domains; the complexity measure of function 
class, the deviation inequality and the symmetrization inequality for domain adaptation. 

• We use the integral probability metric Dj?(S,T) to measure the difference between two 
domains Z^ and Z^ T \ We show that the integral probability metric is well-defined and 
is a (semi)metric on the space of the probability distributions. It can be bounded by the 
summation of the discrepancy distance disc^P^,!)^) and the quantity Qg (gi ,gi ), 
which measure the difference between the input-space distributions T>^ and T)^ and the 
difference between labeling functions g* and g* , respectively. Note that there is a special 
case that is more suitable to the integral probability metric Djr(S, T) than other quantities 
(see Remark [3. ip . 

• The uniform entropy number and the Rademacher complexity are adopted to achieved the 
generalization bounds ( 1251) : ( 1371) and ( 1281) : ( 1381) . respectively. It is noteworthy that the 
generalization bounds ( 1251) and (1371) can lea d to the results based on the fat-shattering 



di mension, respectively (see iMendelsonl . 120031 Theorem 2.18). According to Theorem 2.6.4 
of IVan der Vaart and Wellnerl ( 19961 ). the bounds based on the VC dimension can also be 



obtained from the results (125]) and (1371) . respectively. 

• Instead of directly applying the classical techniques, we present the specific deviation in- 
equalities for the learning process of domain adaptation. In order to obtain the generaliza- 
tion bounds based on the uniform entropy numbers, we develop the specific Hoeffding-type 
deviation inequalities for the two types of domain adaptation, respectively (see Appen- 
dices \M & EJ . Furthermore, we also generalize the classical McDiarmid's inequality to a 
more general setting where the independent random variables can take value from different 
domains (see Appendix [Uj) . 

• We also develop the related symmetrization inequalities of the learning process for domain 
adaptation. The derived inequalities incorporate the discrepancy term that is determined 
by the difference between the source and the target domains and reflects the learning- 
transfering from the source to the target domains. 

Based on the derived generalization bounds, we provide a rigorous theoretical analysis of the 
asymptotic convergence and the rate of convergence of the learning process for either kind of 
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domain adaptation. We also consider the choices of w and r that affect the rate of convergence 
of the learning processes for the two typ es of domain adap t ation, r espectively. Mor e over, we giv e 
a comparison with the prev i ous works iBen-David et al\ ( 120101 ): ICrammer et al\ ( 120061 . 120081 1: 
Mansour et all (120081 . l2009al ) ; iBlitzer et all (120081 1 as well as the related results of the l e arnin g 
process under the assumption of same distribution (see iBousquet et all 12004 ; iMendelsonl . 120031 ) . 
The numerical experiments support our theoretical findings as well. 

In our future work, we will attempt to find a new distance between distributions to develop 
the generalization bounds based on other complexity measures, and analyze other theoretical 
properties of domain adaptation. 
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A Proof of Theorem 15.1 



In this appendix, we provide the proof of Theorem 15 .11 In order to achieve the proof, we need to 
develop the specific Hoeffding-type deviation inequality and the symmetrization inequality for 
domain adaptation with multiple sources. 

A.l Hoeffding-Type Deviation Inequality for Multiple Sources 

Deviation (or concentration) inequalities play an essential role in obtaining the generalization 
bounds for a certain learning process. Generally, specific deviation inequalities need to be devel- 
oped for different learning processes. There are many popular deviation and concentration in- 
equalities, e.g., Hoeff ding's inequality, McDiarmid's inequality, Bennett's inequality, Bernstein's 
inequality and Talagrand's inequality. These results are all built under the assumption of same 
distribution, and thus they are not applicable (or at least canno t be directly app lied) to the 



setting of multiple sources. Next, based on Hoeffding's inequality ( JHoeffdingi . Il963l ). we present 
a deviation inequality for multiple sources. 

Theorem A.l Assume that J 7 is a function class consisting of bounded functions with the range 
[a, b\. Let Z 1 k = {z„ } n =i be the set of i.i.d. samples drawn from the source domain Z^ Sk ' C IR L 
(1 < k < K). Given w e [0, 1] K with Ylik=i w k = 1 an d f or an V f & J 7 , we define a function 
F w : RJ-EiSLitf* ^R as 



K N k 



F w ({Xfnf=i) == £ w *(n*<) £ /(x?)) ' (43) 

k=l i^k n=\ 

where for any 1 < k < K and given Nk £ N, the set X 1 k is denoted as 

Xf-={xS fe) ,x? ) ,---,xS v fc fc ) }G(MT fc - 
Then, we have for any £ > 0, 

Pr{|E^F w -F w ({ZfnU|>e} 
<2 exp { ^ r- \ , (44) 

K 
fc=l- 



where E^ stands for the expectation taken on all source domains {Z^ Sk ^} 



This result is an extension of t he classical Hoe f fding- type deviation inequality under the as- 



sumption of same distribution (see iBousquet et all 12004 . Theorem 1). Compared to the classical 



result, the resultant deviation inequality (1441) is suitable to the setting of multiple sources. These 
two inequalities coincide when there is only one source, i.e., K — 1 

The proof of Theorem IA.1I is processed by a martingale method. Before the formal proof, we 
introduce some essential notations. 

Let {Z 1 fe }^L 1 be sample sets drawn from multiple sources {Z^ Sk ^}^ =1 , respectively. Define a 
random variable 

S (k) ;= E (S) { Fw ({ Z f*}f =1 )|Zf\ZfV. • ,zf*~\Z?} , 1 < k < K, < n < N k , (45) 
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where 

Zy = {z?\4*\... )SB ?>}CZjfc, andZ? = 0. 

It is clear that 

where E^) stands for the expectation taken on all source domains {^^ Sfc ^}^L 1 . 
Then, according to ( 1431) and ( 1451) . we have for any 1 < k < K and 1 < n < Nk, 



S^-Si%=E^{FU{Znti)\Zi\K\---X k -\z{ 

- e^ {F w ({zfnf=i) K , zf 2 , • • ■ , zf*-\ zr 1 } 
= E(S) E^dl^E/^K 1 ' 2 ^---' 2 ^ 1 ' 2 ! 



. fc=l i^fc n=l 



(_ k=l i^k n=l 

fe— 1 AT; 71 

1=1 tyl j=l i^k j=l 

( K N, N k 

+E <« y. ^(rHE/^v^iH e /<■} 

U=fc+1 i^Z 3=1 i^k j=n+l 

fc-1 AT; 71-1 

-E<*(IH EM") - «*(n*0 E^f ) 

Z=l i^i j=l i+k j=l 

( K N t N k 

- e<s * e «-< ( n w .) e /(•?) + «* (n «i) e /(•?' 

^;=fc+i j^/, j=i i^fc j=?i 



To prove Theorem lA.il we need the following inequality resulted from Hoeffding's lemma. 
Lemma A.l Let f be a function with the range [a, b]. Then, the following holds for any a > 0: 



JZ/-l_„\% 



E . ea( /(z( S ))-E( S )/)j< e ^L (47) 



/ 

Proof. We consider 

(/(z<*>) - E<*>/) 

as a random variable. Then, it is clear that 

E{f(z^)-E^f} = 0. 
Since the value of E^ / is a constant denoted as e, we have 

a-e< /(z (5) )-E (5) /<&-e. 
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According to Hoeffding's lemma, we then have 

E | eQ (/( Z < s ')-E^)/)|< e ^^. (48) 

This completes the proof. ■ 

We are now ready to prove Theorem IA.1I 

Proof of Theorem \A.l\ According to ( |43|) . (|46|) . Lemma [A. 11 Markov's inequality, Jensen's 
inequality and the law of iterated expectation, we have for any a > 0, 

Pr{^w({ZfnU-E^F w >e} 

=e~*E [e | e ^-^»-(^-^-0 Izf 1 , • • • , zf^ 1 , Zf^U 

=e-e E jeK^^^-^O" W- 5 ^-0)e [eK^-^-0 | Z f , • ■ • , Zf-\ Zf- 1 ' 

=e -^ E {e a ( E - lE ^ fc) - 5 ^ 

^Ele^^'i" ~ b "-i)-[ l >N K - i3 N K -i)) \ e — - — S ; (49) 



where Zf K_1 := {zj K) , • • • , z^Lj C Zf x . Therefore, we have 



Pr JF W ({Zf*}f =1 ) - E (5) F W > e} < e*^- a «, (50) 



where 



a 2 ' 



$(«) = £ 1 L. (51) 

Similarly, we can obtain 

Pr {e^F w - F w ({Zf ft }f =1 ) > e} < e* (Q) - a «. (52) 

Note that $(a) — a£ is a quadratic function with respect to a > and thus the minimum value 
"min Q>0 {$(«) — ck£}" is achieved when 



By combining (1501) . ( loTT) and ( )52l) . we arrive at 

Pr{|E^F w -F w ({Zf}f=i)|>^} 

2£ 2 



<2exp 



(&-«) 2 (nti^)(Eti^(n^^) 
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This completes the proof. ■ 

In the following subsection, we present a symmetrization inequality for domain adaptation 
with multiple sources. 

A. 2 Symmetrization Inequality 

Symmetrization inequalities are mainly used to replace the expected risk by an empirical risk 
computed on another sample set that is independent of the given sample set but has the same 
distribution. In this manner, the generalization bounds can be achieved by applying some kinds 
of complexity measures, e.g., the covering number and the VC dimension. Howe ver, the classical 



symm etrization results are built under the assumption of same distribution (see iBousquet et al 



2004]). The symmetrization inequality for domain adaptation with multiple sources is presented 



in the following theorem: 

Theorem A. 2 Assume that T is a function class with the range [a,b]. Let sample sets {Z 1 k }^ =1 
and {Z' 1 fe }^L 1 be drawn from the source domains {Z ( - Sk ^}^ =1 . Then, given an arbitrary £ > 

D™ (S,T) and w = (w\,--- ,wk) € [0, 1] K with J2k=i w k = 1> we have for any (Ylk=i^k) > 

8(b-a) 2 
(S') 2 ' 



Pr (sup \E^f - ECf)/| > A < 2Pr (sup pif/ - E^/l 
l/eJ" J 1/s.F 



>f>, (53) 



where ? = £- d£\s,T). 



This theorem shows that given £ > D™ (S,T), the probability of the event: 

sup|E( T V-E^/| >£ 

can be bounded by using the probability of the event: 

i( w )/ 



sup|Et s >/-Eg>/|> *-V g ' T) 



(54) 

f&T' * 



that is only determined by the characteristics of the source domains {2^ Sh '}^ =1 when Ylk=i ^k > 

,(b-a ' 
(C) 2 



L»2 with £' = £ — D™ (S,T). Compared to the classical symmetrization result under the as- 



sumption of same distribution (see IBousquet et al\ . \2004 ). there is a discrepancy term D-p (S,T) 



>( w ), 



in the derived inequality. Especially, the two results coincide when any source domain and the 
target domain match, i.e., Djr(Sk,T) = holds for any 1 < k < K. The following is the proof 
of Theorem IA.2I 

Proof of Theorem \A.2\ Let / be the function achieving the supremum: 

sup \E^f - Eg>/| 

with respect to the sample set {Z x k }^ =v According to (jHJ), (EJ), ( TT21) and ( |26l) . we arrive at 

|E<*>/- Ej?/1 = \E^f-E {S) f + E {S) f- E^/| < DP(S,T) + \E {S) f- E<?/|, (55) 

25 



and thus, 

Pr {\E^f - Ei?/| > e} < Pr {d ( ™\S,T) + |E (5) / - E<?/| > ^} , (56) 

where the expectation E^/ is defined as 

E (S) /:=f>E^)/. (57) 

fc=i 

Let 

£':=£-I^ w) (S,T), (58) 

and denote A as the conjunction of two events. According to the triangle inequality, we have 

(|E (S) / - EJ?/) - \E'^f - E {S) f\) < |E*?/ - E<?/|, 

and thus for any £' > 0, 



(59) 



=1 {|e (S) /-eI s) /1>C'}a{|e« s '/-E'1 5) /I<^} 

Then, taking the expectation with respect to {7i' 1 k }% =1 gives 

( 1 p-/-E i; "7 l >,) Pr '{l ElS, ^- E - , ^ < l} 

<Pr'||Et"/-El s '/1>|}. 

By Chebyshev's inequality, since {Z' 1 fc }^L 1 are the sets of i.i.d. samples drawn from the 
multiple sources {Z lySk ^}^ =1 respectively, we have for any £' > 0, 

Pr'{|E (5) /-El S) /| > |} <Pr'|f:^El E(Sfc) /-/( z/ i fc) )l ^ f } 

=Pr' {l>(lH ^ |E(Sfe)/ - /V « fe))l ^ ^H 

L fc=l i^fc 71=1 

4E{EL^(n^^)ES 1 |E(^)/-/(zf))| 2 } 



(nf =1 ^ fc ) 2 (a 2 

4E{Ef = i^(n^ fc ^)iV fc (&-a) 2 } 

" (nf =1 ^) 2 (a 2 

<UliN k )(b-a) 2 _ 4 (6 -a) 2 



(nf =1 ^ fc ) 2 (e) 2 (a 2 (nti^) 
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(60) 



Subsequently, according to (|59|) and (!60|) . we have for any £' > 0, 



iv < pi? / - e<?/i > |} > (i |E , sl/ -_ E ,, />£ ,) (i - (f) 4 2( (6 n 7 ) JVi ) 



(61) 



By combining (156|) . ( 158|) and ( I6"T|) . taking the expectation with respect to {Z 1 k }]: =1 and letting 

A(b-af 1 



(a 2 (nf=i^)~2 

can lead to: for any £ > D™ (S,T), 

Pr{|E^/-Ei?/| >£} <Pr{|E (5) /-E(?/| >^ 

<2Pr||E/J?/-ES?/|>|J 



(62) 



with £' = £ — Dp(S, T). This completes the proof. ■ 

By using the resultant deviation inequality and the symmetrization inequality, we can achieve 
the proof of Theorem 15.11 



A. 3 Proof of Theorem 15.11 



Proof of Theorem \5.1\ Consider e as an independent Rademacher random variables, i.e., an 
independent {—1, l}-valued random variable with equal probability of taking either value. Given 
sample sets {Z x k }^ =l , denote for any / 6 T and 1 < k < K, 



T>( fc ) — (SV 



M ik) 



M 



6 1 > " ' e Af fe ' 6 1 >" " ■> "N k 



®e{-i,i} 



2N k 



and for any / G J 7 , 



?(Z^) := (/(zf )),... ,/(z'W),/(zf >),..- ,/(*«)). 



(63) 



(64) 



According to (jSJ), © and Theorem IA.21 given an arbitrary £ > D™ (S,T), we have for any 



(w). 



{iV fc }f =1 e N K such that nf=i N k > 8 -^- with ? = £ - d£\s, T), 
Pr|sup|E( T )/-E^/|>e| 
< 2Pr j sup I E'|f / - E^ } / 1 > |- 1 (by Theorem |AJ| 



A' 



AT* 



2Pr 8u p E^E(/( z 'i fc) )-/« fc) )) 



./^ i " A/ fc n _ ! 



> 



K N k 

1 /^ I t^i N k ^ 



£' 



>! 



^(^lE^O^.?^)) 



/e-7 71 



fe=i 



>! 



(by (JB3J and (JED)) 



(65) 
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Fix a realization of {Z 1 k }^ =1 and let A be a £'/8-radius cover of T with respect to the 
^{{7a 1 k }k =1 ) norm. Since J 7 is composed of the bounded functions with the range [a, b], we 
issume that the same holds for any h G A. If /o is the function that achieves the following 
;iiT-irpmnm 



assume 
supremum 



K 



^(^>?(zf")> 4 



'<=■>■ fc=i 

there must be an /*o 6 A that satisfies 
a 



-?lg^-- — 



£ H- (1/0(4") - V*?')! + l/o(4") - M4")l) < |, 
fe=l fc 



and meanwhile, 



A' 



£^ (fc) ^(zf fc )> 



fe=i 



> 77- 



\ k }k=D we arrive at 



Therefore, for the realization of {Z x fc }|L 

Pr (^p|E^r(^ (fe) '7(z^)> 
U 6Jrl ^ 27V * 



. heA k=l 

Moreover, we denote the event 






A' 



HMd£t<^ (z ^> 



>c 



(66) 



and let 1,4 be the characteristic function of the event A. By Fubini's Theorem, we have 
Pr{A} = E{E^{l A }|{Z^}f=i} 



(67) 
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Fix a realization of {Z x k }^ =1 again. According to (1631) . f|64|) and Theorem lA.il we have 

A 



<|A|max Pr (|f J|(^),7?(Zf>)> 



>i 



k=l 



> 



t' 



^(j- ) e78,^({Zfnf=i))ma X Pr||Ef ) / i -Ef h\ > |} 



<^(^e78,C({Zrnf=i))max 



Pr l\E {S) h - E'^h\ + \E iS) h - Eg>h\ > |] 



<2A"(^.r/S./r({Zr fc }f =1 ))maxP, ^lE^/i-El^/il >C 



<4ArLF,r/8,C({Zf Vfc }f=i) exp 



(nf = i^)(e-^ w) (g,T) 

"32(6-a)2(Ef =1 ^(n« fc ^)) ( 



(68) 



T^( s ) 



where the expectation E is defined in (1571) . 



(w), 



The combination of ( 1651) . ( 1661) and ( 1681) leads to the result: given an arbitrary £ > _D)r (5 1 , T) 



A 



and for any fj^i N k > 



s(b- a y 
FFF~ 



(w). 



with? = S-D£>(S,T) 



Pr|sup|E( T )/-E^/|>£J 



<8EAf (>, £78, CdZfnLi)) exp 



A 



< 



8A^ ^,£78,2^^ exp 



(nf=i^)(e-^ w) (g,r) 
(nf =1 ^)(^-4 w) (^^ 



fe=i 



32(6-a)^(Eti^(n ¥fc ^)) 



According to (1691) . letting 



A 



e:=8jV7 J-,£78,2j]A fe exp 



fc=1 / , '32(6-a)2(Ef =1 ^(n^AT 4 )) 

we then arrive at with probability at least 1 — e, 

/ 



snp\E^f-E^f\<D^(S,T) + 
feT 



)(w) 



\ 



lnA^(^£78,2Ef =1 A fc ) -ln(e/8 



\ 32(6-a)2(Ef = i«'i(n^ fc Ar i )) / 



where £' = £ — D^- (S,T). This completes the proof. 



(69) 
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B Proof of Theorem 16.1 



Here, we provide the proof of Theorem 16.11 Similar to the situation of domain adaptation with 
multiple sources, we need to develop the related Hoeffding-type deviation inequality and the 
symmetrization inequality for domain adaptation combining source and target data. 

B.l Hoeffding-Type Deviation Inequality 

Based on Hoeffding's inequality ( JHoeffdingl . Il963l ). we derive a deviation inequality for the com- 



bination of the source and the target domains. 

Theorem B.l Assume that J 7 is a function class consisting of bounded functions with the range 
[a,b]. LetZ ± s := {z n } n =i andZ ± T := {z„ } n = x be sets ofi.i.d. samples drawn from the source 
domain Z^ s > C 1R L and the target domain Z^ T ' C M L , respectively. For any r G [0, 1), define a 
function F T : R l ( n s+Nt) _+ R as 

Nt Ns 

F r (Xf,Yf) :=riV 5 ^/(x n ) + (l-r)iV T g/( yn ), (70) 

n=l n=l 

where 

Xf- := {x 1; ... , XjVt } G (R l )«t; Y^ := {y 1; • • • ,y^} G (M L )^. 

Then, we have for any r G [0, 1) and any C, > 0, 



Pr{|F T (zf,zf T )- E WF T |>e} 



- 2exp \ (6-a) 2 iV 5 iV r ((l-r) 2 iV T + r 2 iV 5 )/' (71) 

where the expectation E^ is taken on both of the source domain Z^ and the target domain Z^ . 

In this theorem, we present a deviation inequality for the combination of source and target 
domains, which is an extension of the classical Hoeffding -type deviation inequality under the 



assumption of same distribution (see lBousquet et a/.l . l2004 Theorem 1). Compare to the classical 



result, the resultant deviation inequality ( 1711) allows the random variables to take values from 
different domains. The two inequalities coincide when the source domain Z^ and the target 
domain Z^ T ' match, i.e., Djr(S,T) = 0. 

The proof of Theorem IB. II is also processed by a martingale method. Before the formal proof, 
we introduce some essential notations. 

For any r G [0, 1), we denote 

Ns N T 

F S (Z»°) := (1 - r)N T £ /(zf ); F T (zf T ) := rN s £ /(zf )). (72) 

n=l n=l 
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Recalling ([70]), it is evident that F r (zf s ,zf T ) = F s (Zf s ) + F T (zf T ). We then define two 
random variables: 



S n :=E^{F s (Z? s )\Zi}, 0<n<N s ; 
T n :=E^{F T (^ T )\^} } Q<n<N Tl 



where 






,zf}CZf s with Z° 
,zf)}czf with t 



0; 
0. 



It is clear that S = E^F S ; S Ns = F 5 (Zf s ) and T = E^F T ; T Nt = F T (zf T ) 
According to (1701 and (1751) . we have for any 1 < n < Af§ and any r e [0, 1) 

<->n ~~ <Jn— 1 

= E (5) {F 5 (Zf s )|Z™} -E (5) {F^Zf^lZ™- 1 } 



AT.c 



n=l 



AT S 



E (S) (1 - r)iV r £ /(zf) Z? - E^ (1 - r)N T £ /(z 



(S)> 



n=l 



rn- 1 



iV s 



;i-r)iVT^/(zL 5) ) + E^ (l-r)iV r £ /(z 



(Sh 



m=l 



m=n.+l 



n-1 



A's 



" I (1 " r)N T J2 f(^ } ) + E (5) (1 - r)N T £ /(z 

\ m=l (, m=n 

= (l-r)A7 T ( / (zf)-E^) / ). 
Similarly, we also have for any 1 < n < N?, 

T n -T r ^ 1 =rN s (f(^)-^f). 



m i 



We are now ready to prove Theorem IB. II 

Proof of Theorem \B.l[ According to (170"]) and (1721) . we have 

F r (Zf ) - E«F T =F s (Zf s ) + F r (zf T ) - E«{F 5 + F T } 
=F s (Zf*) - E<*>F 5 + F T (zf T ) - E^F T 



(73) 



(74) 



(75) 



(76) 
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According to Lemma LA .14 ( 174|) . (!75|) . ( 176|) . Markov's inequality, Jensen's inequality and the 
law of iterated expectation, we have for any a > and any r e [0, 1), 



Pr{F T (Zf)-E«F T >£} 
=Pr {F 5 (Zf*) - E<*>F 5 + F T (zf T ) - E^F T > ^} 

<e -a ?E | e «(F s (Z^)-E(^ s+ F T (Zr T )-EmFr) 1 

=e -<*E | e «(ESr 1 (^-s„_ 1 )+E^ 1 (Tn-T n _ 1 )) E r 

=e- Q? E ( e a (^"=i" 1 ( 5 "- s "- 1 ) + ^«=i _1 ^- T "- 1 ))E J>( t "t-^ t -i) zf T_1 )\ 



„ a [ Sn s ~Sn s - 1 , 



) 2 ]v|, a 2 (6-a) 2 



zi 



N S -1 



x e 



(l-T) 2 JV 2 ,Q i 2 (6-a) 2 



<e- Q? E / e a (^r 1 (S»-*.-i)+E^ 1 " 1 (Tn-T„_ 1 )) j g i 



2 jv|a 2 (6-a) 2 (l-r) 2 ]V 2 ,Q 2 (b-a) 2 



Then, we have 



where 



PrJF r (zf s ,Z~f T ) -E«F T > e} < e*< a >-<* 



a 2 (l-T) 2 (ft-a) 2 iVgiV| a 2 r 2 (b - a) 2 N 2 N T 



8 



Similarly, we can arrive at 



Pr{E«F T -F T (z^zf T )>£}< 



8 



s $(a)-c^ 



(77) 



(78 



(79) 



50) 



Note that $(«) — a£ is a quadratic function with respect to a > and thus the minimum value 



is achieved when 



min{$(a) — a£} 

a>0 



4s c 



" ~ (b - a) 2 N s N T ((1 - t) 2 N t + r 2 N s ) ' 
By combining ( jTBj) . ( 179]) and ( |80|) . we arrive at 

Pr {|F r (Zf 5 , zf T ) - E«F T | > C} < 2exp J- 

This completes the proof. 



2£ 2 



(6 - a) 2 ^^ ((1 - r) 2 N T + r 2 AT s 
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B.2 Symmetrization Inequality 

In the following theorem, we present the symmetrization inequality for domain adaptation com- 
bining source and target data. 

Theorem B.2 Assume that J 7 is a function class with the range [a,b\. Let Z 1 s and 7J X s 
be drawn from the source domain Z ( ' , and Z x and TJ x be drawn from the target domain 
Z( T \ Then, for any r 6 [0,1) and given an arbitrary £ > (1 — t)Dj^(S,T), we have for any 
N q N T > 8{b ~ a f 



Pr jsup |E (T) / - E T /| > O < 2Pr jsup \E' T f - E T f\ 



>I 



wiih? = S-(l-T)Dj:{S,T). 

This theorem shows that for any £ > (1 — r)Djr(S, T), the probability of the event: 

sup|E( T V-E T /| >£ 

can be bounded by using the probability of the event: 

sup|E' T /-E T /|>! 

that is only determined by the samples drawn from the source domain Z^ and the target 
domain Z^ T \ when N$Nt > \ ~ ° ■ Compared to the classical symmetrization result under 



the assumption of same distribution (see iBousquet et all I2004J ). there is a discrepancy term 



(1 — r)Djr(S,T). The two results will coincide when the source and the target domains match, 
i.e., Djr(S,T) = 0. The following is the proof of Theorem IB. 2 1 

Proof of Theorem \B.Si Let / be the function achieving the supremum: 

sup|E( T )/-E T /| 
feT 

with respect to Z 1 s and Z : T . According to (Q and (TT2|) . we arrive at 

\Wf- E T f\ = \rE^f + (1 - r)E^f - (1 - r)E^f + (1 - r)E<*>/ - E r f\ 

= \r(E^f- E g/) + (1 - r)(E<*>/ - E<*>/) + (1 - r)(E( s )/ - E<g/)| 
<(1 - r)^(5,T) + |t(E^/ - Eg/) + (1 - r)(E^/ - E<g/)|, (82) 

and thus 

Pr{|E( T )/-E T fl>£} 

<Pr {(1 - r)ZMS, T) + \r(E^f - Eg/) + (1 - r)(E^/ - E<g/)| > £} , (83) 
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where 

, N T N s 



^N T J — N ■ S,J \" n h ^N S J — N 

1 n=l * n=\ 



Let 

r = e-(l-r)^(5,T) (85) 

and denote A as the conjunction of two events. According to the triangle inequality, we have 



1 {|r(Em/-E^/)+(l-r)(E(S)/-E^/)|>e}J ^ 1 {|r(E(T)/-E'^/)+(l-r)(E(S)/-E'^/)|<|} 

1 {| r ( E m/-E<^/)+(l-r)(E(S)/-E^/)|>^ 



- 1 {|r(E/^/- E ^^ + (l-r)(E'W/-EW/)|>^}- 



Then, taking the expectation with respect to Z 1 s and Z' x gives 



1 {|r(Em/-E^/)+(l-r)(E(S)/_E^/)|>e}^ 

x Pr' { |r(E^)/ - E'S/) + (1 - r)(E^/ - E'g/)| < |} 
<Pr'||r(E'£7-E£7) + (l-r)(E'Sg7-Eig/)| > |} . (86) 

By Chebyshev's inequality, since Z\ s = {z'^ } n =i and Z' x T = {z„ }„=! are sets of i.i.d. 
samples drawn from the source domain Z^ and the target domain Z^ respectively, we have 
for any £' > and any r G [0, 1), 

Pr' { \r(E^f- E'!g7) + (1 - r)(E<*>7- E'!g7)| > f } 



^T 1 N S 

^' \ w T E |E(T)/ " /(Z " T))I + ^ £ i E(s) / - ^ z 'f )i > | 

ra=l n=l 



< 



< 



4E |riV s iV T (E( r )7- 7(z ,(T) )) 2 + (1 - r)iV 5 iV T (E( s )7- 7(z ,(S) )) 2 } 
4E {TN s N T {b - a) 2 + (1 - T)N s N T (b - a) 2 } 

WW? 

4(6 -a) 2 



a^Vt(£') 2: 



(87) 



where z'*- ^ and z'^ ' stand for the ghost random variables taking values from the source domain 
Z^ and the target domain Z^ T \ respectively. 



34 



Subsequently, according to (!86|) and (!H7|) . we have for any £' > 0, 



'( T )f _ -p( T )?\ _i_ r-i _ ^/"tp/( 5 )?_ ■p( s ) r\l ^ £_ 



Pr' -I k(E'!2/ - Eg/) + (1 - r)(E'Sg/ - E<g/)| 



>ll„ _. „- .- ,.-, 1 IU- 4( "- a)2 



"{|r(Em/-E^/)+(l-r)(E(S)/_E^/)|>f} y / ^ ]V 5 iV T (£') 2 

According to (JE3D, flHSD and flgg), by letting 

4(6 -a) 2 1 
N S N T (?) 2 ~ 2' 

and taking the expectation with respect to Z 1 s and Z 1 , we have for any £ > 0, 

Pr{|E( T )/-E T /|>e} 
<Pr {\r(E^f - Eg/) + (1 - r)(E<*>/ - E<g/)| > £'} 

<2Pr ||r(E'£/ - Eg/) + (1 - r)(E/g/ - E«g/)| > f } 
=2Pr(|E' T /-E r fl >|1 (89) 



with £' = £ — (1 — t)Djt(S, T). This completes the proof. ■ 

We are now ready to prove Theorem 16.11 

B.3 Proof of Theorem ED] 

Proof of Theorem \6.1\ Consider {e n }n=i as independent Rademacher random variables, i.e., 
independent {±l}-valued random variables with equal probability of taking either value. Given 

{en}nii, Un}n=i, T>\ Ns and Z ± T , denote 

~e s :=(ei,-" ,eiv s ,-ei,--- , -ejv s ) G {±l} 27Vs ; 

"^t :=(ei, • • • , e* T , -ei, ■ ■ • , -e Nr ) G {±1} 2Nt , (90) 



and for any / G J 7 , 

?(Zf*) :=(/&), • • • , /(z^), /( Zl ), • • • , f(z Ns )) G [a, 6] 2 ^; 
?(Zf ') :=(/&), • • • , /(z' Wt ), /( Zl ), • • • , /(*%.)) G [a, 6] 2 ^. (91) 

We also denote 

Z :={zf T ,Z 2 ^} G (Z^) 27Vt x (Z^)) 2 ^; 



N s N T 



?(Z) :=(7(ZD,--- ,?(zT T ),7(Z?n... ,?(Zf-)) G [a,6r^. (92) 



Ng N T 
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According to fl6]), (!85|) and Theorem IB.2| for any r G [0,1) and given an arbitrary £ > 



1 - t)Djt(S, T), we have for any JV S JV T > 2i2=£L with £' = £-(!- t)Dj?(S, T) 



e J 



Pr|sup|E( T V-E r /| >e| 
<2Pr(sup|E' r /-E T /|>| 



(by Theorem IB.2p 



=2Pr -^ sup 

J eJr 

=2Pr < sup 
=2Pr < sup 

1/G.F 



1-T 



JV S 



£- E (/( z 'i T) ) - /( z i T) )) + V 1 E (/( z 'i 5) ) - /( z i s) )) 



n=l 



N s 



n=l 

N S 



>! 



^■E^W^-z^M + ^-E^C/^)-/^)) 



N- 



T 



n=l 



'-> 



JVc 



re=l 



> 



*' 



' 7 ^)> + W^' 7(ZrS)> l > l} 



(93) 



Given a r G [0, 1), fix a realization of Z and let A be a £'/8-radius cover of J 7 with respect to the 
^i(Z) norm. Since J 7 is composed of bounded functions with the range [a, b], we assume that the 
same holds for any h G A. If /o is the function that achieves the following supremum 



sup 



2JV-, 



-<^,?(zr)> + i^<i> s ,? ( z^)> ^ " 



2Nc 



> 



there must be an hn G A that satisfies that 



r 

2iV> 



E (l/o(*T) - ^o«)| + |/o(zf >) - ^(zDl) 

n=l 



+ yE(l/o( z f) " *oOOl + l/o( Z f ) - ^o(zf )|) < |, 

n=l 



and meanwhile, 



T ^tX& Nt )) 



l ~T,^ ~* 



Therefore, for the realization of Z, we arrive at 



?s, «o(zr s )> 



> 



£' 



Pr < sup 

1/e.F 

<Pr < sup 
1/ieA 



r 



2N T 

r 



- ?(zf T ; 



ct, 



'-> -7> ,-^2N. 



1-T 

l-r 



^,?(zr s )> 



-r> 



2N T 
Moreover, we denote the event 



? T ,H(zr)) + ^(-t s ,j(zrn) 



2N : 



s 



> 



> 



(94) 



A := {Pt ■{ sup 

/iGA 



/— > T"/77 27V Tn\ l~ r /— )■ "^ 



2A T 



Tr, ft'(ZT)> 



2Nc 



?s,H(Zl Ns )) 



>^ 
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and let 1a be the characteristic function of the event A. By Fubini's Theorem, we have 
Pr{A} = EJE^{l A }|z} 



=E 



{ Pr { 



sup 

heA 



T '? T ,t^) + ^{-t s ,t ($»')) 



2N' 



2N S 



> 



Z . 



(95) 



Fix a realization of Z again. According to ( 12~T|) . ( 19"UJ) . (19~TJ) and Theorem IB. 1[ for any r e [0, 1) 



and given an arbitrary £' > 0, we have for any N$Nt > (ctw 



Pr <( sup 

,/ieA 



1 r '^,t(z^)> 



;-t T ,7f(zT r )>-L^-^/>- ^^ 



>^ 



< I A I max Pr 

hGA 



2A T N ' v x y/ 2N S 



> 



£' 



=A^(-7 r ,^78,£[(Z))maxPr||E , r /i-E r /i| > j\ 
<A/7J r ,£78,£T(Z))maxPrj|E/i-E7/i| + |E/i-E r /i| > - 

— v j> / iv j j h&A y ii '4 

<2A^(.7 r ,«e78,£[(Z))maxPr||E/i-E r /i| > - 



heA 



<4A/(J-,r/8,£l(Z))exp 



N s N T ^-(l-r)D^S,T)f 
'32(6 - a) 2 ((1 - t) 2 N t + r 2 A s 



(96) 



where Eh := rE^/i + (1 - r)E^h. 

The combination of ( 122]) . (193]) . (j9i]l and (196]) leads to the following result: for any r <G [0, 1) 
and given an arbitrary £ > (1 — t)Dj^(S, T), we have for any NsNt > ,Z?J , 



Pr|sup|E( T )/-E T /| >e| 



<8EAr(-7 r ,£V8,£[(Z))exp 



A 5 A T (£-(l-r) J D^(S,T)) i 



<8A^(^,e78,2(A s + A T ))exp 



32(6 - a) 2 ((1 - r) 2 A r + r 2 ^) 

A s A T (£-(l-r) J D^(S,T)) 2 



According to (1971) . letting 



e := 8A^(^, £78, 2(A S + N T )) exp 



32(6 - a) 2 ((1 - r) 2 A T + r 2 A^) 



N s N T (£-(l-T)Dr(S,T)) 2 
~ 32(b - a) 2 ((I - t) 2 N t + t 2 N s ) 



(97) 



we have given an arbitrary £ > (1 — t)Djt(S,T) and for any N S N T > v,,°' , with probability 
at least 1 — e, 

|e,/ - e^/| < (1 - T) n A s, T) + ( W' ?<*• ««£ «*» - '^ 8 > ' 

This completes the proof. 



SUP|^ Ti/ ^ J \ -\^ 'J^J-\^,^J i I NgN 

f eF \ 32(b-a)' 2 ((l-T)' 2 N T +T 2 N s ) 
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C Proofs of Theorems 15.21 & 16.2 



In this appendix, we will prove Theorem 15.21 and Theorem 16.21 In order to achieve t he proofs, 
we need to generalize the classical McDiarmid's inequality (see iBousquet et all 12004 Theorem 
6) to a more general setting where independent random variables can independently take values 
from different domains. 



C.l Generalized McDiarmid's inequality 

The following is the classical McDiarmid's inequality that is one of the most frequently used 
deviation inequalities in statistical learning theory and has been widely used to obtain general- 
izati on bounds based on th e Rademacher complexity under the assumption of same distribution 
(see IBousquet et all 12004 Theorem 6). 



Theorem C.l (McDiamid's Inequality) Let Zi, • • • , z/v be N independent random variables 
taking value from the domain Z. Assume that the function H : Z — ?■ M satisfies the condition of 
bounded difference: for all 1 < n < N, 



sup 

Z\,— ,Zjv,z'n 



»( 



Zl, 



Z N ) - H(zx, 



. Zjv) 



< c n . 



(98) 



Then, for any £ > 

Pr |#(zx, • • • , z n , • • • , z^) - E|#(zi, 



-2£7E C « 



n=l 



As shown in Theorem 10.11 the classical McDiarmid's inequality is valid under the condition 
that random variables z l5 ■ ■ • , zjy are independent and drawn from the same domain. Next, we 
generalize this inequality to a more general setting, where the independent random variables can 
take values from different domains. 

Theorem C.2 Given independent domains Z^ Sk ^ (1 < k < K), for any 1 < k < K, let 

Zi 1 k := {zn } n =i be N^ independent random variables taking values from the domain Z^ Sk \ 
Assume that the function H : (^Z^ Sl ^ 1 x • • ■ x 



fgOSjOj _). u satisfies the condition of bounded 



difference: for all 1 < k < K and 1 < n < N&, 

F(Zf 1 ,---,Zf fc - 1 ,zf fc) 



sup 

,iVi ... „N K (S k ) 



,(S h ) (S fc ) y N k+1 7 N K \ 

'n i ' i L N k i "I i 1^1 ) 



-^(zf,---,zf fe - ! -> iSi] 

Then, for any £ > 



,/(s fc ) 



) Z l 5 ' ' ' 5 Z n 1 ' ' ' 5 Z N k J 



(S k ) rjN k + 



2 JV Hi . . . <li K) 



<d 



(*) 



(99) 



K N k 



Pr{if(Z^,...,Zf-)-E{i?(Zf,...,Zf-)}>e}<exp -2(7^^ 



M\2 



k=l n=l 



38 



Proof. Define a random variable 

T« := E {#({Zf*}f=i)|Zf\ Zf 2 , • • • , Zf*"\ Z?} , 1 < fc < X, < n < N k , (100) 

where 

Z? = {zfU fc V--,4 fc) KZf, andZ? = 0. 

It is clear that 

7f = E{tf({Zf }£,,)} and l£> = H({Zf* }* J, 

and thus 

i?({zMLi)-E{^({zfnf =1 )} =^-t (i) = ^tf-*)' ( 101 ) 

fc=l n=l 

Denote for any 1 < k < K and 1 < n < Nk, 

^ , =Bup{r«Lc«^-r»}; 
iW=inf{r«| aS , w -r*> 1 }. 

It follows from the definition of (1 1001) that L„ < (T„ — T„_\) < [/"„ and thus results in 

Tf ) - TS < l/« - L« = sup {l? ) | z ,, )=M - T« | z ( fc)= J < c?>. (102) 

Moreover, by the law of iterated expectation, we also have for any 1 < k < K and 1 < n < N^ 

EJTW-rSizfszf 2 ,--- ,zf*- 1 ,z?- 1 } = o. (103) 



According to Hoeffding inequality (see iHoeffdingi . Il963l ) , given an a > 0, the condition (1911 leads 
to for any 1 < k < K and 1 < n < Nk, 

EJe^^-^-^lZfSZf ,-.. X"- 1 ,^' 1 } <e Q2 ( c " fc) ) 2 / 8 . (104) 
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Subsequently, according to Markov's inequality, fllOlj) . f ll02j) . f)103|) and f ll04|) . we have for 
any a > 0, 



Pr{if(ZfS-..,Zf-)-E{i7(Zf,...,Zf-)}>e} 

< e -^ E | ea (^(zf 1 ,-,zf-)-E{H(zf 1 ,.,zf-)})| 



=e -«€E{e a ( E ^ E -i (T » h) - :r »-i ) ) 

=e-*E {e je-Cs^sSx^-^i)) W\ ■ ■ • , zf-\ Zf-" 1 

^-^EJe^^^^^^ 
< e ^ E {e a (sfe 1 5^Cr» w -e\HESr 1 ^-r 1 ffi)) 

a 2 (ci fc) ) 2 



e ° 2 2 /8 



<.-<nn 

fe=ln=l 



exp 



K N k - c ( fc ). 2 

fc=l n=l 

The above bound is minimized by setting 



Q ' 



4£ 



vK 



Z^fc=l 2^71=1^ 



^-^W)2 : 



and its minimum value is 



This completes the proof. 



K N k 
fc=l 71= 1 



r^K-l E^K-lll 



C.2 Proof of Theorem ET21 



By using Theorem IC2[ we prove Theorem 15.21 as follows: 

Proof of Theorem 15 . 21 Assume that J 7 is a function class J 7 consisting of bounded functions 
with the range [a, b]. Let sample sets {Z x k }f =1 := {{z„ } n =i}fLi be drawn from multiple sources 
Z( s *>) (1 < k < K), respectively. Given a choice of w e [0, 1] K with J2k=i w k = 1> denote 



By (00, we have 



#(Z 



#(Zf, 



JVi 

i > 



,Zf-):=sup|Ef/-E( T )/|. 



A' 






k=\ 



(105) 



(106) 



where E^/^E^/tf) 



Therefore, it is clear that such if (Z^ 1 , • • • , Z : K ) satisfies the 



condition of bounded difference with 



,(*) 



(b - a)w k 
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for all 1 < k < K and 1 < n < Nj,. Thus, according to Theorem IC.21 we have for any £ > 0, 



Pr{#(ZfV-. X k )-^{H{Z»\... ,Z^)}>e}<exp 



-2? 



K (b-a) 2 w 2 I ' 






which can be equivalently rewritten as with probability at least 1 — e, 



#(Zf fc , • • • , Zf* ) <E( s ){i7(Zf fc , • • • , Zf fc ) } 



A - 



\ fc=l 



(6-o) 2 ^ln(l/e) 



2N, 



A 



E^ sup X> fc (E<g/-E<*V) 



A' 



lis 



(b-afwlHl/e) 



2N, 



-E^ (snp I £>*(E<g/ - E<*>/ + E<*>/ - £<*>/) 
[/ ejr fc=i 



+ 



\ fc=i 



(b-q) 2 ^ln(l/e) 

2iV, 



A 



A 



< 



E<*> snp J2M E nJ ~ E (5fc) /) + E^ SU P l E(5k) / - E(T) / 



feF 



fe=l 



fc=i 



/6-F 



+ 



A 



\ fc=l 



(6 - a) 2 ^ ln(l/e 
2iVV 



< 



E<*> {snp I f>*(E<g/ ~ E (5fc) /) 1 + D^\S,T) [see © 



+ 



\ fc=i 



(6-a) 2 W 2 ln(l/e) 



2iV fc 



(107) 
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Next, 



according to ( 123|) and (11061) . 

A 



we have 



E(*>sup|f>(Eig/-E^/) 



n ''f -E ,(Sh) {E'Sg/}) 



=E^snp|X:^(E^ / 
^E^sup^^^E^V-EtV) 



< 



/^'fci 



a 1 iv* 
=E (fl) E' (S) snp I>4" £ (/(*?>) " /(z'f)) 

<E^E'^E CT snp E w ^ £^ } (M fc) ) - fV?)) 

J^ fc=l K n=l 

A iV fc 

<2E^E CT sup J> ' ^^)/(z«) 

/e^ I S Nk n=l 

K N k 

<2E^E„J>-Uup 2>?>/(«?>) 



=2j> fe 7^(.F)- 

fc=i 

By combining (j!05p . fjlOTj) and (11081) . we obtain with probability 

if 



at least 1 — e 



snp |E£>/ - E^/| < Z^ w) (S,T) +2V^«(J) + 



fc=l 



A' 



\ fc=l 



(&-a) 2 i^ln(l/e) 

2iVfc 



This completes the proof. 



(108) 



C.3 Proof of Theorem IQ1 



In order to prove Theorem 16. 2\ we also need the following result (see iBousquet et all I2004J . 
Theorem 5): 

Theorem C.3 Let J 7 C [a, b] z . For any e > 0, with probability at least 1 — e, there holds that 
for any f 6 J*, 



E/ <Ejv/ + 2K(F) + 



(6 - o) ln(l/e) 

2iV 



<E JV / + 2ft w (J r ) + 3 



(6 - a) ln(2/e) 

2iV 



(109) 
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Again, we prove Theorem 16.21 by using Theorems IC.2I and IC.3I 

Proof of Theorem 16.21 We only consider the result of Theorem IC.2I in a special setting of 
K — 2, Ni — Nt, N 2 = Ns, w\ = t and W2 = 1 — r. Given a choice of r 6 [0, 1), denote 



tf(zf*,zf r ):=sup|E r /-E^/| 



By ([9]), we have 



feT 



W 



H(Z»°,ZF) = sup rEg/ + (1 - r)E<g/ - E<*V 



(5) 
s- 



According to Theorem IC.2[ we have for any £ > 0, 



Pr{H(Z^X T ) -E^{H(Z^ X T )} >i) <exp 



-2e 



which can be equivalently rewritten as with probability at least 1 — (e/2), 

#(zf s ,zf T ) 



(6-q) 2 ln(2/e) f f_ (1 - r) 2 

2 liVr + iV 5 



=E<*> { 



sup 

1/6 J- 



rE^/ + (l-r)Eig/-ECn/ 



E^{sup|r(ES2/-E^ / ) + (l-r)(Eg / -E^ / )|} 



'(6-a) 2 ln(2/e) / r 2 (1 - r 



iW 



JVc 



E£/ - E<*V + (1 - r)E^ (sup Eg/ - E<*V 



(6-a) 2 ln(2/e) / r 2 (1 - r) 2 



JVr 



iVc 



E£/ - E^/j + (1 - r)E^ {sup | E <g/ - E<*>/ + E<*>/ - E^/| } 



+ 



'(6-a) 2 ln(2/e) / r 2 (1 - r) 2 



]Vt 



iV, 



<rsup 



ES/ - E^/| + (1 - r)E<*> {sup | E <g/ - E^/| } 

E (S) / _ E (T) / 



+ (1 - r)sup 
/eJ- 



(fe-q) 2 ln(2/e) / r 2 (1 - r) 2 

i: ' 2 V^T N s 



(110) 



:m: 



'(6-a) 2 ln(2/e) ^ (1 - r) 2 

N T N s 
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-T SUp 



Eg/ - E<*>/| + (1 - r)E<*> {sup |Eg/ - E W/|} + (1 - r)^(5,T) 



+ 



'(&-a) 2 ln(2/e 



r^ 



1-r) 



2 VA" T A^ 

According to Theorem I C. 3 \ for any e > 0, we have with at least 1 — (e/2), 



sup 



Next, according to ( l23l) . we have 



(6 - a) ln(4/e) 



2A^7 



E (5) {sup|E^/-E^/|} 

E^/-E' (S) {E'Sg/} 



=E (S) sup 



<E (5) E' (5) sup 



f& 



K'J ~ E^V 

v s 



E^E^sup^^zf)-/^)) 



n=l 



<E^E^E CT sup -L £ a „ ( /(z (f) ) - /(z'f )) 



v s 



,(Sh 



<2E^E CT sup }Va„/(i 

=2^ (5) (^)- 
By combining (jllOp . (1112)) . fll 13j) and fll 14j) . we obtain with probability at least 1 — e, 

sup |E T / - E( T )/| <(1 - t)D t (S, T) + 2(1 - t)-RS s \T) 

feT 



+ 2rK^) + 3rJ( b - a ^) 



+ (l-r) 



2A^t 



>-a) 2 ln(2/e) / r 2 (1-r) 2 



A^ T 



ATc 



This completes the proof. 



(112) 



(113) 



(114) 
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